?

Pre-Processing of 25 scRNA Bone Marrow samples using Seurat

In this worksheet, the Pre-processing of 25 publicly available Bone Marrow samples is described. This includes data acquisition, Quality Control and Cell selection, Normalization and Feature Selection of each individual sample. Afterwards all samples are integrated with the Seurat Anchor Function.

After reducing the complexity with PCA, the integrated data is visualized and evaluated.

1 Preparations and Data

1.0 Required Packages

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(Seurat)
library(ggplot2)

packageVersion("dplyr")
[1] '0.8.99.9003'
packageVersion("Seurat")
[1] '3.1.5'

1.1 Available Data Sets

The read count matrices, genes and barcodes of 25 scRNA Bone Marrow samples were collectivly downloaded from the National Center for Biotechnology Information’s Gene Expression Omnibus (GSE120221), provided by Oetjen et al. 2018.

1.2 Load Data

Each sample was loaded into one Seurat object, stored in a named list (Sample ID and Letter were provided by the directory name of each sample)

data_dir1 <- list.files("~/ExampleSets/BoneMarrowNormal/Data", full.names = TRUE)
names(data_dir1) <- list.files("~/ExampleSets/BoneMarrowNormal/Data")
bmmc_Raw.data <- Read10X(data.dir = data_dir1)
bmmc_Raw <- CreateSeuratObject(counts = bmmc_Raw.data, project = "BoneMarow", min.cells = 3, min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

2 Pre-processing and Integration of the Data

2.1 Quality Control

Information about the percentage of mitochondrial RNA per cell are extracted and plotted for each sample.

#Mitochondrial genes selected 
bmmc_Raw[["percent.mt"]] <- PercentageFeatureSet(bmmc_Raw, pattern = "^MT-")
# Visualize as a violin plot
VlnPlot(bmmc_Raw, features =  "percent.mt", ncol = 1, y.max = 15) + NoLegend()
Warning: Removed 124 rows containing non-finite values (stat_ydensity).
Warning: Removed 124 rows containing missing values (geom_point).

The aggregated correlation of amount of RNA and mitochondrial RNA as well as number of Feature is plotted as a further QC

plot1 <- FeatureScatter(bmmc_Raw, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(bmmc_Raw, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.

2.2 Filtering of the raw data

Cells with a low amount of features or a high amount of mitochondrial RNA are removed. After evaluation the tresholds of the original publication were applied (MT > 8%, less then 500 RNA-Features).

bmmc_Raw <- subset(bmmc_Raw, subset = nFeature_RNA > 500 & percent.mt < 8)

plot1 <- FeatureScatter(bmmc_Raw, feature1 = "nCount_RNA", feature2 = "percent.mt")  + NoLegend()
plot2 <- FeatureScatter(bmmc_Raw, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") + NoLegend()
CombinePlots(plots = list(plot1, plot2))
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.

length(bmmc_Raw@active.ident)
[1] 76645

This number of filtered cells is also reported in the original publication.

2.3 Normalization and Feature selection

Each sample was individually normalized and the top 2000 feautures were selected. Each normalized sample was stored in a list.

bmmc_Raw.list <- SplitObject(bmmc_Raw, split.by = "orig.ident")

bmmc_Raw.list <- lapply(X = bmmc_Raw.list, FUN = function(x) {
    x <- NormalizeData(x, verbose = FALSE)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
})

###2.4 Integration

The 25 samples were combined with the Seurat “FindIntegrationAnchors” and “IntegrateData” method (See Stuart&Butler et al. 2019), using the first 30 Principal components. Remarkly this computational method was not available during the original BMMC study.

bmmc.anchors <- FindIntegrationAnchors(object.list = bmmc_Raw.list, dims = 1:30)
bmmc.combined <- IntegrateData(anchorset = bmmc.anchors, dims = 1:30)
Warning: Adding a command log without an assay associated with it

##3 Dimension reduction

###3.1 Scaling

The integrated dataset is scaled (zero-centered and Scaled)

DefaultAssay(object = bmmc.combined) <- "integrated"
bmmc.combined <- ScaleData(object = bmmc.combined, verbose = FALSE)

###3.2 PCA The pincipal components of the integrated data set are calculated and the top genes contributing for the first two PCs are presented.

bmmc.combined <- RunPCA(bmmc.combined, verbose = FALSE)
VizDimLoadings(bmmc.combined, dims = 1:2, reduction = "pca")

The PCA dimension reduction is shown for the first two PCs:

DimPlot(bmmc.combined, reduction = "pca")

3.3 Dimension determination

The PC components which explained most of the variance are selected based on the ElbowPlot shown below. In this case PCs after PC 20, each PC explains less than 2% of the variance.

ElbowPlot(bmmc.combined, ndims = 50) + geom_hline(yintercept=2)

4 Visualization with UMAP and TSNE

###4.1 UMAP

UMAP is calculated based on the first 20 PC dimensions

bmmc.combined <- RunUMAP(bmmc.combined,reduction = "pca", dims = 1:20)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
16:03:42 UMAP embedding parameters a = 0.9922 b = 1.112
16:03:42 Read 76645 rows and found 20 numeric columns
16:03:42 Using Annoy for neighbor search, n_neighbors = 30
16:03:42 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:04:00 Writing NN index file to temp file /tmp/RtmpBQd4YX/filef567352fdfa7
16:04:00 Searching Annoy index using 1 thread, search_k = 3000
16:04:45 Annoy recall = 100%
16:04:45 Commencing smooth kNN distance calibration using 1 thread
16:04:50 Initializing from normalized Laplacian + noise
16:05:20 Commencing optimization for 200 epochs, with 3839614 positive edges
16:07:27 Optimization finished
DimPlot(bmmc.combined, reduction = "umap")

###4.2 TSNE TSNE is calculated the same way.

bmmc.combined <- RunTSNE(bmmc.combined, dims = 1:20)
DimPlot(bmmc.combined, reduction = "tsne")

5 Storing

The object is stored for further processing

saveRDS(bmmc.combined, file = "./StoredRObj/bmmc_PreProc_Ref2.rds")
sessioninfo::package_info()
 package      * version     date       lib source                          
 ape            5.3         2019-03-17 [2] CRAN (R 3.6.0)                  
 assertthat     0.2.1       2019-03-21 [2] CRAN (R 3.6.0)                  
 cli            2.0.2       2020-02-28 [2] CRAN (R 3.6.0)                  
 cluster        2.1.0       2019-06-19 [2] CRAN (R 3.6.0)                  
 codetools      0.2-16      2018-12-24 [3] CRAN (R 3.6.0)                  
 colorspace     1.4-1       2019-03-18 [2] CRAN (R 3.6.0)                  
 cowplot        1.0.0       2019-07-11 [2] CRAN (R 3.6.0)                  
 crayon         1.3.4       2017-09-16 [2] CRAN (R 3.6.0)                  
 data.table     1.12.8      2019-12-09 [2] CRAN (R 3.6.0)                  
 digest         0.6.25      2020-02-23 [2] CRAN (R 3.6.0)                  
 dplyr        * 0.8.99.9003 2020-05-12 [2] Github (tidyverse/dplyr@75a29ef)
 ellipsis       0.3.0       2019-09-20 [2] CRAN (R 3.6.0)                  
 evaluate       0.14        2019-05-28 [2] CRAN (R 3.6.0)                  
 fansi          0.4.1       2020-01-08 [2] CRAN (R 3.6.0)                  
 farver         2.0.3       2020-01-16 [2] CRAN (R 3.6.0)                  
 fitdistrplus   1.0-14      2019-01-23 [2] CRAN (R 3.6.0)                  
 future         1.17.0      2020-04-18 [2] CRAN (R 3.6.0)                  
 future.apply   1.5.0       2020-04-17 [2] CRAN (R 3.6.0)                  
 generics       0.0.2       2018-11-29 [2] CRAN (R 3.6.0)                  
 ggplot2      * 3.3.0       2020-03-05 [2] CRAN (R 3.6.0)                  
 ggrepel        0.8.2       2020-03-08 [2] CRAN (R 3.6.0)                  
 ggridges       0.5.2       2020-01-12 [2] CRAN (R 3.6.0)                  
 globals        0.12.5      2019-12-07 [2] CRAN (R 3.6.0)                  
 glue           1.4.0       2020-04-03 [2] CRAN (R 3.6.0)                  
 gridExtra      2.3         2017-09-09 [2] CRAN (R 3.6.0)                  
 gtable         0.3.0       2019-03-25 [2] CRAN (R 3.6.0)                  
 htmltools      0.4.0       2019-10-04 [2] CRAN (R 3.6.0)                  
 htmlwidgets    1.5.1       2019-10-08 [2] CRAN (R 3.6.0)                  
 httr           1.4.1       2019-08-05 [2] CRAN (R 3.6.0)                  
 ica            1.0-2       2018-05-24 [2] CRAN (R 3.6.0)                  
 igraph         1.2.5       2020-03-19 [2] CRAN (R 3.6.0)                  
 irlba          2.3.3       2019-02-05 [2] CRAN (R 3.6.0)                  
 jsonlite       1.6.1       2020-02-02 [2] CRAN (R 3.6.0)                  
 KernSmooth     2.23-15     2015-06-29 [3] CRAN (R 3.6.0)                  
 knitr          1.28        2020-02-06 [2] CRAN (R 3.6.0)                  
 labeling       0.3         2014-08-23 [2] CRAN (R 3.6.0)                  
 lattice        0.20-38     2018-11-04 [3] CRAN (R 3.6.0)                  
 lazyeval       0.2.2       2019-03-15 [2] CRAN (R 3.6.0)                  
 leiden         0.3.3       2020-02-04 [2] CRAN (R 3.6.0)                  
 lifecycle      0.2.0       2020-03-06 [2] CRAN (R 3.6.0)                  
 listenv        0.8.0       2019-12-05 [2] CRAN (R 3.6.0)                  
 lmtest         0.9-37      2019-04-30 [2] CRAN (R 3.6.0)                  
 lsei           1.2-0.1     2020-05-06 [2] CRAN (R 3.6.0)                  
 magrittr       1.5         2014-11-22 [2] CRAN (R 3.6.0)                  
 MASS           7.3-51.6    2020-04-26 [2] CRAN (R 3.6.0)                  
 Matrix         1.2-17      2019-03-22 [3] CRAN (R 3.6.0)                  
 munsell        0.5.0       2018-06-12 [2] CRAN (R 3.6.0)                  
 nlme           3.1-139     2019-04-09 [3] CRAN (R 3.6.0)                  
 npsurv         0.4-0.1     2020-05-06 [2] CRAN (R 3.6.0)                  
 patchwork      1.0.0       2019-12-01 [2] CRAN (R 3.6.0)                  
 pbapply        1.4-2       2019-08-31 [2] CRAN (R 3.6.0)                  
 pillar         1.4.4       2020-05-05 [2] CRAN (R 3.6.0)                  
 pkgconfig      2.0.3       2019-09-22 [2] CRAN (R 3.6.0)                  
 plotly         4.9.2.1     2020-04-04 [2] CRAN (R 3.6.0)                  
 plyr           1.8.6       2020-03-03 [2] CRAN (R 3.6.0)                  
 png            0.1-7       2013-12-03 [2] CRAN (R 3.6.0)                  
 purrr          0.3.4       2020-04-17 [2] CRAN (R 3.6.0)                  
 R6             2.4.1       2019-11-12 [2] CRAN (R 3.6.0)                  
 RANN           2.6.1       2019-01-08 [2] CRAN (R 3.6.0)                  
 rappdirs       0.3.1       2016-03-28 [2] CRAN (R 3.6.0)                  
 RColorBrewer   1.1-2       2014-12-07 [2] CRAN (R 3.6.0)                  
 Rcpp           1.0.4.6     2020-04-09 [2] CRAN (R 3.6.0)                  
 RcppAnnoy      0.0.16      2020-03-08 [2] CRAN (R 3.6.0)                  
 reshape2       1.4.4       2020-04-09 [2] CRAN (R 3.6.0)                  
 reticulate     1.15        2020-04-02 [2] CRAN (R 3.6.0)                  
 rlang          0.4.6       2020-05-02 [2] CRAN (R 3.6.0)                  
 rmarkdown      2.1         2020-01-20 [2] CRAN (R 3.6.0)                  
 ROCR           1.0-11      2020-05-02 [2] CRAN (R 3.6.0)                  
 RSpectra       0.16-0      2019-12-01 [2] CRAN (R 3.6.0)                  
 rsvd           1.0.3       2020-02-17 [2] CRAN (R 3.6.0)                  
 Rtsne          0.15        2018-11-10 [2] CRAN (R 3.6.0)                  
 scales         1.1.1       2020-05-11 [2] CRAN (R 3.6.0)                  
 sctransform    0.2.1       2019-12-17 [2] CRAN (R 3.6.0)                  
 sessioninfo    1.1.1       2018-11-05 [2] CRAN (R 3.6.0)                  
 Seurat       * 3.1.5       2020-04-16 [2] CRAN (R 3.6.0)                  
 stringi        1.4.6       2020-02-17 [2] CRAN (R 3.6.0)                  
 stringr        1.4.0       2019-02-10 [2] CRAN (R 3.6.0)                  
 survival       3.1-12      2020-04-10 [2] CRAN (R 3.6.0)                  
 tibble         3.0.1       2020-04-20 [2] CRAN (R 3.6.0)                  
 tidyr          1.0.3       2020-05-07 [2] CRAN (R 3.6.0)                  
 tidyselect     1.1.0       2020-05-11 [2] CRAN (R 3.6.0)                  
 tsne           0.1-3       2016-07-15 [2] CRAN (R 3.6.0)                  
 uwot           0.1.8       2020-03-16 [2] CRAN (R 3.6.0)                  
 vctrs          0.3.0       2020-05-11 [2] CRAN (R 3.6.0)                  
 viridisLite    0.3.0       2018-02-01 [2] CRAN (R 3.6.0)                  
 withr          2.2.0       2020-04-20 [2] CRAN (R 3.6.0)                  
 xfun           0.13        2020-04-13 [2] CRAN (R 3.6.0)                  
 yaml           2.2.1       2020-02-01 [2] CRAN (R 3.6.0)                  
 zoo            1.8-8       2020-05-02 [2] CRAN (R 3.6.0)                  

[1] /home/david.mentrup/R/x86_64-redhat-linux-gnu-library/3.4
[2] /home/common/R
[3] /usr/lib64/R/library
[4] /usr/share/R/library